Leaderboard scaffold: standings JSON + Markov reference + verify CLI by protosphinx · Pull Request #4 · erphq/pm-bench

protosphinx · 2026-05-01T03:12:00Z

Stacked on top of #3 (v0.1 fetch). Merge #2 → #3 → this in order.

Summary

First standings file lands: leaderboard/next-event/synthetic-toy.json, with the Markov reference baseline as the inaugural entry (top-1 0.9756, top-3 1.0, n=41).
pm-bench leaderboard <task> <dataset> [--verify] pretty-prints the table and, with --verify, re-runs scoring on the checked-in predictions to catch drift.
Reference predictions are in-repo (leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz) so the loop is reproducible offline.

What's new

pm_bench/leaderboard.py — load_board, rescore, verify, standings. Pure CPython, reads gzipped or plain CSV. Truth dispatch is keyed on dataset name; today only synthetic-toy is wired (the dispatch grows a branch per pinned dataset).
CLI: pm-bench leaderboard <task> <dataset> prints standings; --verify fails non-zero if recorded scores don't match a fresh rescore.
leaderboard/README.md — submission convention; how to verify locally.
tests/test_leaderboard.py — 8 tests, including a drift canary that tampers with top1 in a tmp copy of the JSON and asserts verify flags it.

Why this matters

Locks the leaderboard JSON schema before any external submission lands.
Makes the Markov number the explicit floor on the leaderboard, not just a number in a README.
Sets up v0.4's CI workflow as a one-step follow-up: it just runs pm-bench leaderboard --verify on the changed files.

Smoke

$ pm-bench leaderboard next-event synthetic-toy --verify
verified 1 entr(ies) — no drift
next-event · synthetic-toy · top1 / top3 accuracy
----------------------------------------
markov-ref  top1=0.9756  top3=1.0000  n=41

Test plan

pytest -q — 45 passed (was 37 on PR v0.1: dataset fetch + sha256 verification + cache resolution #3)
ruff check pm_bench tests — clean
Drift canary: tampering with the recorded top1 produces a top1 drift message via verify()

Roadmap impact

README v0.4 milestone marked 🟡 (scaffold + verify shipped; CI workflow on PRs is the remaining piece).

- leaderboard/next-event/synthetic-toy.json — first standings file, with the Markov-ref entry (top1 0.9756, top3 1.0, n 41) - leaderboard/predictions/next-event/synthetic-toy/markov-ref.csv.gz — reference predictions, checked in so the loop is reproducible without hitting the network - pm_bench/leaderboard.py — load_board, rescore, verify, standings. Reads gzipped or plain CSV; pure CPython (no torch / pandas) - CLI: `pm-bench leaderboard <task> <dataset> [--verify]` — pretty-prints standings, optionally re-runs scoring against the checked-in predictions and fails if recorded != actual - tests/test_leaderboard.py — 8 tests including a drift-detection canary that tampers with the recorded score and confirms verify() flags it - 45 tests total (was 37); ruff clean - README v0.4 milestone marked partial; STATUS + GOALS updated

protosphinx · 2026-05-01T17:54:07Z

Merged into main as part of the audit-cleanup stack (commit 9c00b47). The full content of this PR is now on main.

protosphinx deleted the branch dataset-fetch May 1, 2026 17:54

protosphinx closed this May 1, 2026

protosphinx deleted the leaderboard-scaffold branch May 1, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard scaffold: standings JSON + Markov reference + verify CLI#4

Leaderboard scaffold: standings JSON + Markov reference + verify CLI#4
protosphinx wants to merge 1 commit into
dataset-fetchfrom
leaderboard-scaffold

protosphinx commented May 1, 2026

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

protosphinx commented May 1, 2026

Summary

What's new

Why this matters

Smoke

Test plan

Roadmap impact

Uh oh!

protosphinx commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant